In [1]:
%autosave 0
from IPython.core.display import HTML, display
display(HTML('<style>.container { width:100%; } </style>'))


Autosave disabled

Spam Detection Using a Support Vector Machine

The process of creating a spam detector using a Support Vector Machine is split up into five steps.

  • Create a set of the most common words occurring in spam and ham (i.e. non-spam) emails.
  • Transform every mail into a frequency vector: For every word in the set of most common words, the frequency vector stores the frequency of this word in the respective mail.
  • For every word in the list of most common word, compute the inverse document frequency.
  • Compute the feature matrix by transforming the frequency vectors into vectors that contain the product of the term frequency with the inverse document frequency.
  • Train and test an SVM using this feature matrix.

Step 1: Create the Set of Common Words


In [2]:
import os
import re
import numpy as np
import math

In [3]:
from collections import Counter

The directory https://github.com/karlstroetmann/Artificial-Intelligence/tree/master/Python/EmailData contains 960 emails that are divided into four subdirectories:

  • spam-train contains 350 spam emails for training,
  • ham-train contains 350 non-spam emails for training,
  • spam-test contains 130 spam emails for testing,
  • ham-test contains 130 non-spam emails for testing.

I have found this data on the page http://openclassroom.stanford.edu/MainFolder/DocumentPage.php?course=MachineLearning&doc=exercises/ex6/ex6.html provided by Andrew Ng.

We declare some variables so that this notebook can be adapted to other data sets.


In [4]:
spam_dir_train = 'EmailData/spam-train/'
ham__dir_train = 'EmailData/ham-train/'
spam_dir_test  = 'EmailData/spam-test/'
ham__dir_test  = 'EmailData/ham-test/'
Directories    = [spam_dir_train, ham__dir_train, spam_dir_test, ham__dir_test]

The function $\texttt{get_word_set}(\texttt{fn})$ takes a filename $\texttt{fn}$ as its argument. It reads the file and returns a set of all words that are found in this file. The words are transformed to lower case.


In [5]:
def get_words_set(fn):
    with open(fn) as file:
        text = file.read()
        text = text.lower()
        return set(re.findall(r"[\w']+", text))

The function read_all_files reads all files contained in those directories that are stored in the list Directories. It returns a Counter. For every word $w$ this counter contains the number of files that contain $w$.


In [6]:
def read_all_files():
    Words = Counter()
    for directory in Directories:
        for file_name in os.listdir(directory):
            Words.update(get_words_set(directory + file_name))
    return Words

Common_Words is a numpy array of the 2500 most common words found in all of our emails.


In [11]:
M            = 2500             # number of the most common words to use
Word_Counter = read_all_files()
Common_Words = np.array(list({ w for w, _ in Word_Counter.most_common(M) }))

In [12]:
Common_Words


Out[12]:
array(['hundr', 'handbook', 'san', ..., 'friday', 'instructions', 'act'],
      dtype='<U18')

Step 2: Transform Files into Frequency Vectors

Index_Dict is a dictionary that maps from the most common words to their index in the array Common_Words.


In [13]:
Index_Dict = { w: i for i, w in enumerate(Common_Words) }
Index_Dict


Out[13]:
{'hundr': 0,
 'handbook': 1,
 'san': 2,
 'imagination': 3,
 'pass': 4,
 'effectively': 5,
 'television': 6,
 'generative': 7,
 'initial': 8,
 'enhance': 9,
 'edu': 10,
 'university': 11,
 'system': 12,
 'post': 13,
 'society': 14,
 'fully': 15,
 'comprehensive': 16,
 'frequently': 17,
 'bank': 18,
 'mediumsize': 19,
 'count': 20,
 'extensive': 21,
 'marketer': 22,
 'excess': 23,
 'strongly': 24,
 'constraint': 25,
 'convince': 26,
 'clause': 27,
 'western': 28,
 'blank': 29,
 'hardware': 30,
 'quebec': 31,
 'motivate': 32,
 'al': 33,
 'ourselve': 34,
 'consult': 35,
 'confidential': 36,
 'function': 37,
 'west': 38,
 'order': 39,
 'cloth': 40,
 'concern': 41,
 'property': 42,
 'entitle': 43,
 'greek': 44,
 'explore': 45,
 'mind': 46,
 'application': 47,
 'limit': 48,
 'institut': 49,
 'cognitive': 50,
 'intrusion': 51,
 'finance': 52,
 'parttime': 53,
 'morpheme': 54,
 'water': 55,
 'attention': 56,
 'forum': 57,
 'revenue': 58,
 'belgium': 59,
 'corpus': 60,
 'houston': 61,
 'abstract': 62,
 'determine': 63,
 'since': 64,
 'hours': 65,
 'representation': 66,
 'respond': 67,
 'likely': 68,
 'html': 69,
 'italian': 70,
 'reserve': 71,
 'media': 72,
 'longer': 73,
 'header': 74,
 'progress': 75,
 'consideration': 76,
 'specialist': 77,
 'seven': 78,
 'call': 79,
 'competition': 80,
 'mo': 81,
 'educational': 82,
 'aim': 83,
 'capitalfm': 84,
 'ignore': 85,
 'info': 86,
 'summarize': 87,
 'austin': 88,
 'old': 89,
 'client': 90,
 'advance': 91,
 'female': 92,
 'mexico': 93,
 'hbe': 94,
 'underlie': 95,
 'im': 96,
 'n': 97,
 'maximum': 98,
 'visit': 99,
 'science': 100,
 'weekly': 101,
 'growth': 102,
 'florida': 103,
 'md': 104,
 'diploma': 105,
 'business': 106,
 'aspect': 107,
 'edward': 108,
 'extremely': 109,
 'verify': 110,
 'card': 111,
 'ram': 112,
 'conduct': 113,
 'practice': 114,
 'mci': 115,
 'law': 116,
 'enquiry': 117,
 'march': 118,
 'fl': 119,
 'compute': 120,
 'driver': 121,
 'online': 122,
 'travel': 123,
 'services': 124,
 'replace': 125,
 'discovery': 126,
 'christian': 127,
 'organise': 128,
 'juno': 129,
 'compatible': 130,
 'disk': 131,
 'cannot': 132,
 'recent': 133,
 'space': 134,
 'ever': 135,
 'amex': 136,
 'empirical': 137,
 'restriction': 138,
 'key': 139,
 'basis': 140,
 'sociolinguistic': 141,
 'concept': 142,
 'most': 143,
 'sexually': 144,
 'rock': 145,
 'judgment': 146,
 'mr': 147,
 'article': 148,
 'rom': 149,
 'photo': 150,
 'arrange': 151,
 'excite': 152,
 'everythe': 153,
 'five': 154,
 'presence': 155,
 'attract': 156,
 'amaze': 157,
 'update': 158,
 'forward': 159,
 'york': 160,
 'prospective': 161,
 'hundred': 162,
 'homebase': 163,
 'piece': 164,
 'upgrade': 165,
 'guest': 166,
 'even': 167,
 'chat': 168,
 'kind': 169,
 'xxx': 170,
 'profitable': 171,
 'spout': 172,
 'forthcome': 173,
 'display': 174,
 'exp': 175,
 'list': 176,
 'sense': 177,
 'morphological': 178,
 'writer': 179,
 'generally': 180,
 'traffic': 181,
 'sery': 182,
 'day': 183,
 'collect': 184,
 'lists': 185,
 'ed': 186,
 'faster': 187,
 'vary': 188,
 'agent': 189,
 'der': 190,
 'debt': 191,
 'category': 192,
 'non': 193,
 'proper': 194,
 'foreign': 195,
 'wherea': 196,
 'meet': 197,
 'canadian': 198,
 'mortgage': 199,
 'ad': 200,
 'factor': 201,
 'fantastic': 202,
 'player': 203,
 'flame': 204,
 'city': 205,
 'musical': 206,
 'sue': 207,
 'reward': 208,
 'mac': 209,
 'remark': 210,
 'estate': 211,
 'represent': 212,
 'stage': 213,
 'eventually': 214,
 'green': 215,
 'device': 216,
 'major': 217,
 'interactive': 218,
 'government': 219,
 'query': 220,
 'coordinator': 221,
 'hello': 222,
 'prefer': 223,
 'position': 224,
 'immediate': 225,
 'lose': 226,
 'quite': 227,
 'blvd': 228,
 'venue': 229,
 'favourite': 230,
 'still': 231,
 'minimal': 232,
 'germanic': 233,
 'ongo': 234,
 'case': 235,
 'college': 236,
 'comparative': 237,
 'professor': 238,
 'newsletter': 239,
 'anonymous': 240,
 'noun': 241,
 'america': 242,
 'seminar': 243,
 'harri': 244,
 'advertise': 245,
 'dozen': 246,
 'hottest': 247,
 'novel': 248,
 'l': 249,
 'residual': 250,
 'un': 251,
 'etc': 252,
 'student': 253,
 'returns': 254,
 'was': 255,
 'log': 256,
 'conclusion': 257,
 'serious': 258,
 'entertainment': 259,
 'memory': 260,
 'planck': 261,
 'sound': 262,
 'word': 263,
 'ba': 264,
 'bear': 265,
 'jame': 266,
 'rush': 267,
 'really': 268,
 'happen': 269,
 'pick': 270,
 'sort': 271,
 'macintosh': 272,
 'detail': 273,
 'theory': 274,
 'african': 275,
 'least': 276,
 'us': 277,
 'indefinite': 278,
 'ensure': 279,
 'station': 280,
 'pic': 281,
 'convenience': 282,
 'phonetic': 283,
 'du': 284,
 'addresses': 285,
 'engineer': 286,
 'responsibility': 287,
 'first': 288,
 'indicate': 289,
 'speakers': 290,
 'response': 291,
 'persistent': 292,
 'papers': 293,
 'acquire': 294,
 'discount': 295,
 'fact': 296,
 'id': 297,
 'practically': 298,
 'personality': 299,
 'contain': 300,
 'increase': 301,
 'six': 302,
 'price': 303,
 'seek': 304,
 'intention': 305,
 'fast': 306,
 'linguistic': 307,
 'oral': 308,
 'payable': 309,
 'distinction': 310,
 'tips': 311,
 'works': 312,
 'among': 313,
 'protect': 314,
 'performance': 315,
 'delay': 316,
 'september': 317,
 'few': 318,
 'decade': 319,
 'structure': 320,
 'verb': 321,
 'obtain': 322,
 'identification': 323,
 'doe': 324,
 'capability': 325,
 'perception': 326,
 'present': 327,
 'relationship': 328,
 'scholar': 329,
 'must': 330,
 'bruce': 331,
 'xerox': 332,
 'cmu': 333,
 'jeff': 334,
 'describe': 335,
 'paid': 336,
 'surely': 337,
 'biz': 338,
 'advertiser': 339,
 'methodology': 340,
 'vium': 341,
 'programme': 342,
 'learn': 343,
 'v': 344,
 'significant': 345,
 'greatest': 346,
 'search': 347,
 'emerge': 348,
 'spain': 349,
 'microsoft': 350,
 'reports': 351,
 'east': 352,
 'arabic': 353,
 'positive': 354,
 'criterion': 355,
 'cut': 356,
 'wide': 357,
 'goods': 358,
 'graphic': 359,
 'especially': 360,
 'technical': 361,
 'electronic': 362,
 'laboratory': 363,
 'produce': 364,
 'nc': 365,
 'hi': 366,
 'video': 367,
 'poster': 368,
 'apply': 369,
 'appeal': 370,
 'professional': 371,
 'asset': 372,
 'lunch': 373,
 'faculty': 374,
 'bag': 375,
 'highway': 376,
 'van': 377,
 'd': 378,
 'hold': 379,
 'ca': 380,
 'generate': 381,
 'karen': 382,
 'scott': 383,
 'luck': 384,
 'bid': 385,
 'seller': 386,
 'past': 387,
 'administration': 388,
 'access': 389,
 'chance': 390,
 'anne': 391,
 'bibliography': 392,
 'newsgroup': 393,
 'environment': 394,
 'reservation': 395,
 'ascii': 396,
 'storm': 397,
 'doctor': 398,
 'john': 399,
 'imply': 400,
 'hop': 401,
 'speaker': 402,
 'premier': 403,
 'opinion': 404,
 'nj': 405,
 'ext': 406,
 'hardcopy': 407,
 'sale': 408,
 'school': 409,
 'specific': 410,
 'subscription': 411,
 'keyword': 412,
 'camera': 413,
 'solve': 414,
 'contents': 415,
 'ours': 416,
 'additional': 417,
 'equivalent': 418,
 'manchester': 419,
 'production': 420,
 'shop': 421,
 'msn': 422,
 'ibm': 423,
 'registration': 424,
 'population': 425,
 'observation': 426,
 'mailing': 427,
 'postal': 428,
 'however': 429,
 'zero': 430,
 'beach': 431,
 'food': 432,
 'context': 433,
 'onto': 434,
 'discussion': 435,
 'financial': 436,
 'powerful': 437,
 'place': 438,
 'fund': 439,
 'sorry': 440,
 'mode': 441,
 'junk': 442,
 'rights': 443,
 'benefits': 444,
 'spend': 445,
 'jone': 446,
 'expression': 447,
 'fourth': 448,
 'extent': 449,
 'audience': 450,
 'lower': 451,
 'london': 452,
 'easily': 453,
 'why': 454,
 'largest': 455,
 'step': 456,
 'model': 457,
 'clean': 458,
 'mention': 459,
 'girl': 460,
 'fall': 461,
 'skeptical': 462,
 'education': 463,
 'confident': 464,
 'russian': 465,
 'british': 466,
 'code': 467,
 'set': 468,
 'puzzle': 469,
 'assumption': 470,
 'fortune': 471,
 'brian': 472,
 'harvard': 473,
 'coordinate': 474,
 'pp': 475,
 'thank': 476,
 'latest': 477,
 'loui': 478,
 'britain': 479,
 'wonder': 480,
 'nijmegen': 481,
 'hide': 482,
 'germany': 483,
 'region': 484,
 'boyfriend': 485,
 'demand': 486,
 'along': 487,
 'lie': 488,
 'area': 489,
 'firm': 490,
 'automatic': 491,
 'stuttgart': 492,
 'assistant': 493,
 'hypothesis': 494,
 'phonetics': 495,
 'owe': 496,
 'onetime': 497,
 'relevant': 498,
 'planet': 499,
 'reader': 500,
 'umontreal': 501,
 'believer': 502,
 'line': 503,
 'billion': 504,
 'personally': 505,
 'put': 506,
 'operation': 507,
 'representative': 508,
 'unit': 509,
 'cdrom': 510,
 'contribution': 511,
 'nothing': 512,
 'envelope': 513,
 'statement': 514,
 'often': 515,
 'thompson': 516,
 'relax': 517,
 'phone': 518,
 'refinance': 519,
 'monday': 520,
 'korean': 521,
 'inc': 522,
 'moreover': 523,
 'secure': 524,
 'facility': 525,
 'usually': 526,
 'little': 527,
 'thought': 528,
 'dollars': 529,
 'experiment': 530,
 'charle': 531,
 'proceedings': 532,
 'quantity': 533,
 'artificial': 534,
 'random': 535,
 'colingacl': 536,
 'cable': 537,
 'th': 538,
 'ac': 539,
 'instead': 540,
 'gamble': 541,
 'steven': 542,
 'fee': 543,
 'broadcast': 544,
 'communication': 545,
 'deadline': 546,
 'seem': 547,
 'made': 548,
 'expensive': 549,
 'plural': 550,
 'background': 551,
 'genie': 552,
 'world': 553,
 'rest': 554,
 'means': 555,
 'color': 556,
 'cv': 557,
 'nobody': 558,
 'spokane': 559,
 'air': 560,
 'overt': 561,
 'approve': 562,
 'catch': 563,
 'derive': 564,
 'fresh': 565,
 'native': 566,
 'accepted': 567,
 'boss': 568,
 'conversational': 569,
 'paradise': 570,
 'guarantee': 571,
 'whole': 572,
 'man': 573,
 'statistical': 574,
 'prompt': 575,
 'start': 576,
 'frank': 577,
 'correct': 578,
 'centre': 579,
 'cm': 580,
 'install': 581,
 'delivery': 582,
 'listen': 583,
 'zone': 584,
 'jump': 585,
 'between': 586,
 'social': 587,
 'interface': 588,
 'tune': 589,
 'december': 590,
 'celebrate': 591,
 'corporations': 592,
 'importance': 593,
 'distinguish': 594,
 'shall': 595,
 'pennsylvanium': 596,
 'country': 597,
 'speed': 598,
 'dori': 599,
 'team': 600,
 'european': 601,
 'picture': 602,
 'stuff': 603,
 'teach': 604,
 'cfp': 605,
 'company': 606,
 'url': 607,
 'historical': 608,
 'invest': 609,
 'either': 610,
 'those': 611,
 'intelligent': 612,
 'our': 613,
 'inch': 614,
 'home': 615,
 'legitimate': 616,
 'slavic': 617,
 'contrastive': 618,
 'staff': 619,
 'assessment': 620,
 'scientific': 621,
 'participant': 622,
 'button': 623,
 'dialogue': 624,
 'enterprise': 625,
 'gay': 626,
 'discover': 627,
 'stun': 628,
 'sprachwissenschaft': 629,
 'propose': 630,
 'synthesis': 631,
 'late': 632,
 'awhile': 633,
 'war': 634,
 'state': 635,
 'package': 636,
 'blackwell': 637,
 'newest': 638,
 'match': 639,
 'lead': 640,
 'japanese': 641,
 'official': 642,
 'celebrity': 643,
 'french': 644,
 'publisher': 645,
 'near': 646,
 'font': 647,
 'art': 648,
 'train': 649,
 'reject': 650,
 'thus': 651,
 'answer': 652,
 'user': 653,
 'familiar': 654,
 'too': 655,
 'draft': 656,
 'expense': 657,
 've': 658,
 'mclaughlin': 659,
 'consonant': 660,
 'secondary': 661,
 'notify': 662,
 'ann': 663,
 'worth': 664,
 'saturday': 665,
 'round': 666,
 'total': 667,
 'much': 668,
 'suggestion': 669,
 'ohio': 670,
 'black': 671,
 'wh': 672,
 'walter': 673,
 'stamp': 674,
 'logic': 675,
 'ny': 676,
 'foot': 677,
 'embark': 678,
 'preference': 679,
 'morn': 680,
 'joseph': 681,
 'likewise': 682,
 'activity': 683,
 'register': 684,
 'store': 685,
 'typological': 686,
 'disc': 687,
 'alter': 688,
 'pound': 689,
 'translate': 690,
 'lawful': 691,
 'did': 692,
 'adult': 693,
 'doubt': 694,
 'subscribe': 695,
 'virtual': 696,
 'development': 697,
 'speech': 698,
 'cite': 699,
 'sean': 700,
 'max': 701,
 'washington': 702,
 'support': 703,
 'goal': 704,
 'californium': 705,
 'cost': 706,
 'alway': 707,
 'batch': 708,
 'il': 709,
 'content': 710,
 'ma': 711,
 'male': 712,
 'permanently': 713,
 'amateur': 714,
 'wife': 715,
 'structural': 716,
 'accessible': 717,
 'computational': 718,
 'term': 719,
 'extractor': 720,
 'head': 721,
 'integration': 722,
 'studies': 723,
 'dramatically': 724,
 'utterance': 725,
 'latex': 726,
 'merciless': 727,
 'profile': 728,
 'illustrate': 729,
 'bed': 730,
 'dare': 731,
 'strength': 732,
 'dure': 733,
 'integrate': 734,
 'perfectly': 735,
 'myself': 736,
 'die': 737,
 'enable': 738,
 'robbery': 739,
 'market': 740,
 'sexual': 741,
 'primary': 742,
 'furthermore': 743,
 'consider': 744,
 'achievement': 745,
 'component': 746,
 'pragmatic': 747,
 'write': 748,
 'privacy': 749,
 'profit': 750,
 'stealth': 751,
 'pure': 752,
 'item': 753,
 'nd': 754,
 'convention': 755,
 'robert': 756,
 'accept': 757,
 'weekend': 758,
 'cum': 759,
 'advantage': 760,
 'retrieval': 761,
 'here': 762,
 'benjamin': 763,
 'christopher': 764,
 'tom': 765,
 'film': 766,
 'bulk': 767,
 'role': 768,
 'movie': 769,
 'field': 770,
 'successful': 771,
 'omit': 772,
 'anthony': 773,
 'att': 774,
 'off': 775,
 'address': 776,
 'lanse': 777,
 'grateful': 778,
 'service': 779,
 'fill': 780,
 'coffee': 781,
 'applicant': 782,
 'begin': 783,
 'listing': 784,
 'organiser': 785,
 'every': 786,
 'txt': 787,
 'wait': 788,
 'fairchild': 789,
 'rise': 790,
 'vision': 791,
 'pm': 792,
 'expiration': 793,
 'phillip': 794,
 'nice': 795,
 'channel': 796,
 'mit': 797,
 'angele': 798,
 'currency': 799,
 'break': 800,
 'dependency': 801,
 'morphology': 802,
 'sake': 803,
 'candidate': 804,
 'finger': 805,
 'stop': 806,
 'htm': 807,
 'adopt': 808,
 'useful': 809,
 'membership': 810,
 'specifically': 811,
 'sent': 812,
 'psychology': 813,
 'fm': 814,
 'trash': 815,
 'entirely': 816,
 'enter': 817,
 'gov': 818,
 'choose': 819,
 'almost': 820,
 'tx': 821,
 'schedule': 822,
 'ahead': 823,
 'tv': 824,
 'nearly': 825,
 'private': 826,
 'phd': 827,
 'retire': 828,
 'study': 829,
 'chomsky': 830,
 'various': 831,
 'plans': 832,
 'waste': 833,
 'proud': 834,
 'ordering': 835,
 'self': 836,
 'description': 837,
 'november': 838,
 'loan': 839,
 'quickly': 840,
 'requirement': 841,
 'corner': 842,
 'responsible': 843,
 'prove': 844,
 'europe': 845,
 'roll': 846,
 'nature': 847,
 'hand': 848,
 'remove': 849,
 'pittsburgh': 850,
 'accomodation': 851,
 'paste': 852,
 'hundreds': 853,
 'addition': 854,
 'a': 855,
 'behalf': 856,
 'health': 857,
 'teacher': 858,
 'classroom': 859,
 'gold': 860,
 'william': 861,
 'oversea': 862,
 'dream': 863,
 'http': 864,
 'hour': 865,
 'download': 866,
 'occur': 867,
 'continual': 868,
 'topic': 869,
 'free': 870,
 'predicate': 871,
 'exactly': 872,
 'released': 873,
 'contract': 874,
 'unlimited': 875,
 'toll': 876,
 'approximately': 877,
 'exceed': 878,
 'employ': 879,
 'engage': 880,
 'universitaet': 881,
 'pro': 882,
 'summer': 883,
 'letter': 884,
 'standard': 885,
 'let': 886,
 'far': 887,
 'visa': 888,
 'spot': 889,
 'overview': 890,
 'import': 891,
 'attitude': 892,
 'sender': 893,
 'arrive': 894,
 'pre': 895,
 'relate': 896,
 'lecturer': 897,
 'undoubtedly': 898,
 'action': 899,
 'electronically': 900,
 'co': 901,
 'thesis': 902,
 'le': 903,
 'own': 904,
 'ticket': 905,
 'latin': 906,
 'book': 907,
 'brief': 908,
 'audio': 909,
 'version': 910,
 'owner': 911,
 'quit': 912,
 'barbara': 913,
 'presentation': 914,
 'themselve': 915,
 'texa': 916,
 'incorporate': 917,
 'lawyer': 918,
 'current': 919,
 'availability': 920,
 'launch': 921,
 'opportunity': 922,
 'discourse': 923,
 'exclusive': 924,
 'series': 925,
 'reflect': 926,
 'homepage': 927,
 'de': 928,
 'sign': 929,
 'principle': 930,
 'effective': 931,
 'reality': 932,
 'postage': 933,
 'genuine': 934,
 'path': 935,
 'partner': 936,
 'particle': 937,
 'scientist': 938,
 'experience': 939,
 'solution': 940,
 'bring': 941,
 'reconstruction': 942,
 'formation': 943,
 'spam': 944,
 'vendor': 945,
 'semantics': 946,
 'comparison': 947,
 'common': 948,
 'sex': 949,
 'sum': 950,
 'due': 951,
 'association': 952,
 'everything': 953,
 'clearly': 954,
 'contribute': 955,
 'mailto': 956,
 'tilburg': 957,
 'nlg': 958,
 'two': 959,
 'anon': 960,
 'shock': 961,
 'refund': 962,
 'traditional': 963,
 'analyze': 964,
 'strategy': 965,
 'culture': 966,
 'parent': 967,
 'financially': 968,
 'campus': 969,
 'cheque': 970,
 'demonstrate': 971,
 'verbal': 972,
 'million': 973,
 'attempt': 974,
 'characteristic': 975,
 'exercise': 976,
 'discuss': 977,
 'totally': 978,
 'suitable': 979,
 'main': 980,
 'effect': 981,
 'hr': 982,
 'cognition': 983,
 'sources': 984,
 'literary': 985,
 'center': 986,
 'welcome': 987,
 'spring': 988,
 'implementation': 989,
 'international': 990,
 'technology': 991,
 'voice': 992,
 'philosophy': 993,
 'remember': 994,
 'condition': 995,
 'together': 996,
 'mine': 997,
 'santa': 998,
 'guideline': 999,
 ...}

The function $\texttt{transform_to_vector}(L)$ takes a list of words $L$ and transforms this list into a vector $\mathbf{v}$. If $\texttt{CommonWords}[i] = w$, then $\mathbf{v}[i]$ specifies the number of times that $w$ occurs in $L$.


In [14]:
def transform_to_vector(L):
    Result = np.zeros((len(Common_Words, )))
    for w in L:
        if w in Index_Dict:
            Result[Index_Dict[w]] += 1
    return Result

The function $\texttt{get_word_vector}(fn)$ takes a filename fn, reads the specified file and transforms it into a feature vector.


In [15]:
def get_word_vector(fn):
    with open(fn) as file:
        text = file.read()
        text = text.lower()
        return transform_to_vector(re.findall(r"[\w']+", text))

Step 3: Compute the Inverse Document Frequency

In natural language processing, the notion term is used as a synonym for word. Given a term $t$ and a document $d$, the term frequency $\texttt{tf}(t, d)$ is defined as $$ \texttt{tf}(t, d) = \frac{d.\texttt{count}(t)}{\texttt{len}(d)}, $$ where $d.\texttt{count}(t)$ counts the number of times $t$ appears in $d$ and $\texttt{len}(d)$ is the length of the list representing $d$.

A corpus is a set of documents. Given a term $t$ and a corpus $\mathcal{C}$, the inverse document frequency $\texttt{idf}(t,\mathcal{C})$ is defined as $$ \texttt{idf}(t,\mathcal{C}) = \ln\left(\frac{\texttt{card}(\mathcal{C}) + 1}{\texttt{card}\bigl(\{ d \in \mathcal{C} \mid t \in d \}\bigr) + 1}\right). $$ The addition of $1$ in both nominator and denominator is called Laplace smoothing. This is necessary to prevent a division by zero error for those terms $t$ that do not occur in the list Common_Words.

Step 4: Compute the Feature Matrix

The function $\texttt{feature_matrix}(\texttt{spam_dir}, \texttt{ham_dir})$ takes two directories that contain spam and ham, respectively. It computes a matrix $X$ and a vector $Y$, where $X$ is the feature matrix and for every row $r$ of the feature matrix, $Y[r]$ is 1 if the mail is ham and 0 if it's spam.

The way $X$ is computed is quite inefficient, it would have been better to initialize $X$ as a matrix with the shape $(N,M)$, where $N$ is the number of mails and $M$ is the number of common words.


In [16]:
def feature_matrix(spam_dir, ham_dir):
    X = []
    Y = []
    for fn in os.listdir(spam_dir):
        X.append(get_word_vector(spam_dir + fn))
        Y.append(0)
    for fn in os.listdir(ham_dir):    
        X.append(get_word_vector(ham_dir + fn))
        Y.append(+1)
    X = np.array(X)
    Y = np.array(Y)
    return X, Y

We convert the training set into a feature matrix.


In [17]:
%%time
X_train, Y_train = feature_matrix(spam_dir_train, ham__dir_train)


CPU times: user 163 ms, sys: 20.7 ms, total: 184 ms
Wall time: 183 ms

Up to now, the feature matrix contains only the term frequencies. Next we multiply with the inverse document frequencies.


In [18]:
N, _ = X_train.shape
IDF  = {}
for w, i in Index_Dict.items():
    IDF[w] = np.log((N + 1) / (Word_Counter[w] + 1))
    X_train[:, i] = X_train[:, i] * IDF[w]

We build the feature matrix for the test set.


In [19]:
X_test, Y_test = feature_matrix(spam_dir_test, ham__dir_test)

In [20]:
for w, i in Index_Dict.items():
    X_test[:, i] = X_test[:, i] * IDF[w]

Step 5: Train and Test a Support Vector Machine


In [21]:
import sklearn.svm as svm

Train an SVM and compute the accuracy on the training data.


In [22]:
M = svm.SVC(kernel='linear', C=100000)
M.fit  (X_train, Y_train)
M.score(X_train, Y_train)


Out[22]:
1.0

Compute the accuracy for the test data.


In [23]:
M.score(X_test, Y_test)


Out[23]:
0.9884615384615385

In [ ]: